Authorship Identification for Heterogeneous Documents

نویسندگان

Yuta Tsuboi

Yuji Matsumoto

چکیده

The study of authorship identification in Japanese has for the most part been restricted to literary texts using basic statistical methods. In the present study, authors of mailing list messages are identified using a machine learning technique (Support Vector Machines). In addition, the classifier trained on the mailing list data is applied to identify the author of Web documents in order to investigate performance in authorship identification for more heterogeneous documents. Experimental results show better identification performance when we use the features of not only conventional word N-gram information but also of frequent sequential patterns extracted by a data mining technique (PrefixSpan).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Local n-grams for Author Identification Notebook for PAN at CLEF 2013

Our approach to the author identification task uses existing authorship attribution methods using local n-grams (LNG) and performs a weighted ensemble. This approach came in third for this year’s competition, using a relatively simple scheme of weights by training set accuracy. LNG models create profiles, consisting of a list of character n-grams that best represent a particular author’s writin...

متن کامل

A Framework for Authorship Identification in the Internet Environment

Misuse of anonymous online communication for illegal purposes has become a major concern [2,12]. In this paper, we present a framework named ART (Authorship Recognition Tool), that is designed to minimize manual procedures and maximize the efficiency of authorship identification based on the content of Internet electronic documents. The framework covers the phases of document retrieval and data...

متن کامل

The Keyboard Dilemma and Authorship Identification

The keyboard dilemma is the problem of identifying the authorship of a document that was produced by a computer to which multiple users had access. This paper describes a systematic methodology for authorship identification. Validation testing of the methodology demonstrated 95% cross validated accuracy in identifying documents from ten authors and 85% cross validated accuracy in identifying fi...

متن کامل

Co-authorship network analysis and social network indicators of coronavirus research

Background and aim: The aim of this study was to examine the status of documents related to coronavirus based on scientometric indicators and to draw a co-authorship map of authors, organizations and countries producing an article to get to know this field as much as possible. Materials and methods: This applied-scientometric was conducted using social network analysis. The statistical populati...

متن کامل

Authorship Verification Using the Impostors Method Notebook for PAN at CLEF 2013

This paper describes the evaluation of the GenIM method, which participated in the PAN' 13 authorship identification competition. The approach is based on comparing the similarity between the given documents and a number of external (impostor) documents, so that documents can be classified as having been written by the same author, if they are shown to be more similar to each other than to the ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2002

Authorship Identification for Heterogeneous Documents

نویسندگان

چکیده

منابع مشابه

Local n-grams for Author Identification Notebook for PAN at CLEF 2013

A Framework for Authorship Identification in the Internet Environment

The Keyboard Dilemma and Authorship Identification

Co-authorship network analysis and social network indicators of coronavirus research

Authorship Verification Using the Impostors Method Notebook for PAN at CLEF 2013

عنوان ژورنال:

اشتراک گذاری